\(^1\) Plateau technique de biologie moléculaire et bioinformatique, INCI - LNCA, Strasbourg

\(^2\) Équipe Modulation épigénétique des processus neurodégénératifs, LNCA, Strasbourg

1 - Log in to Galaxy

  • Go to the Galaxy France website.
  • Log in to your account. If you do not have an account, register first.

2 - History

2.1 - Create a New History

  • Create a new history.

2.2 - Rename the History to “DNA-seq data analysis”

  • Change the name of the new history to “DNA-seq data analysis” by clicking the pencil icon next to the history name.
  • Click on “Unnamed history” at the top of the history panel, enter “DNA-seq data analysis,” and click Save.

3 - Importing Files from Your Computer to Galaxy

  • Download the file sample.bed.gz from this link and upload it to Galaxy.
    • Genome: Mouse (mm9)
    • Format: bed
  • Ensure you are in the “DNA-seq data analysis” history (switch to it if needed).

3.1 - Upload Steps:

  1. Download the file sample.bed.gz.
  2. Upload it into Galaxy under the “DNA-seq data analysis” history:
    • Click on Upload Data.
    • Drag and drop the file into the upload window.
    • Set:
      • Genome: Mouse (mm9)
      • Format: bed

4 - Remove a Dataset

  • Remove the sample.bed.gz dataset from your history by clicking the delete button.

  • Verify that your history is empty:
    • Click Show deleted at the top of the history panel.
    • Permanently delete the file by clicking Purge All Deleted Content.
    • Return to the active view by clicking Show active.

5 - Running a Tool

  • Download the files CRN-107_11-R1.fastq.gz and CRN-107_11-R2.fastq.gz from this link.
  • Import them into your “DNA-seq data analysis” history:
    • Genome: Human Feb. 2009 (GRCh37/hg19) (hg19)
    • Format: <auto detect>

5.1 - Upload Steps:

  1. Download both fastq files.
  2. Upload them into Galaxy:
    • Click Upload Data.
    • Drag and drop CRN-107_11-R1.fastq.gz and CRN-107_11-R2.fastq.gz.
    • Set the genome for both datasets to Human Feb. 2009 (GRCh37/hg19) (hg19).
    • Click Start.

5.2 - Run the tool:

  • Use the FastQC Read Quality Reports tool to analyze the quality of the datasets:
    • Input files: CRN-107_11-R1.fastq and CRN-107_11-R2.fastq.
    • Parameters: Default settings.
  • What is the quality encoding of the two fastq files?

6 - Running Tools Without a Workflow

Analyze the CRN-107 data from reads to variant annotation.

We will limit the analysis to targets located on chromosome 4. Download the file CaptureDesign_chr4.bed here and import it to Galaxy.

6.1 - Tools and Parameters:

  1. Map with BWA-MEM map medium and long reads (> 100 bp) against reference genome
    • Using reference genome: Human (Homo sapiens) (b37): hg19
    • Single or Paired-end reads: Paired
    • Select first set of reads: CRN-107_11-R1.fastq.gz
    • Select second set of reads: CRN-107_11-R2.fastq.gz
    • Set read groups information? Set read groups (Picard style)
      • Read group identifier (ID): Auto-assign Yes
      • Read group sample name (SM): Auto-assign Yes
      • Library name (LB): Auto-assign Yes
      • Platform/technology used to produce the reads (PL): ILLUMINA
      • Platform unit (PU): HS026.2
      • Sequencing center that produced the read (CN): Genomeast
      • Description (DS): CRN-107
      • Predicted median insert size (PI): 250
      • Date that run was produced (DT): 2017-12-13
  2. MarkDuplicates examine aligned records in BAM datasets to locate duplicate molecules.
    • Select SAM/BAM dataset or dataset collection: output of BWA mem
    • Select validation stringency: Silent
  3. FreeBayes bayesian genetic variant detector
    • Choose Freebayes version 1.3.9+galaxy1 (see example below for snpEff)
    • BAM or CRAM dataset: output (bam) of markduplicates
    • Using reference genome: hg19
    • Limit variant calling to a set of regions?
      • Limit by target file
        • Limit analysis to regions in this BED dataset: CaptureDesign_chr4.bed
  4. SnpEff eff: Annotate variants.
    • Choose SnpEff eff version 4.3+T.galaxy2 (see below)
    • Sequence changes (SNPs, MNPs, InDels): output of FreeBayes (VCF)
    • Input format: VCF
    • Output format: VCF (only if input is VCF)
    • Genome source: Locally installed snpEff database
      • Genome:
        • Homo sapiens: hg19
    • Upstream / Downstream length: No upstream / downstream intervals (0 bases)
  5. VCFtoTab-delimited: Convert VCF data to TAB-delimited format.
    • Select VCF dataset to convert: output of SnpEff
  • How many variants are called?

7 - Create a Workflow Out of an Existing History

  1. Extract a workflow from the “DNA-seq data analysis” history:
    • Go to the history menu and select Extract Workflow.

  1. Rename the workflow to “DNA-seq data analysis.”

8 - Edit a Workflow with the Workflow Editor

  1. Open the DNA-seq data analysis workflow in the editor:
    • Go to Workflows (top menu/side bar) and select Edit.

  1. Add the following tools to the workflow:
    • Samtools flagstat (compute mapping statistics after BWA mem).
    • Filter SAM or BAM (remove low-quality reads with MAPQ < 20).
    • Samtools flagstat (compute mapping statistics after filtering).
    • Rename CRN-107_11-R1.fastq.gz box to Read 1 (fastq).
    • Rename CRN-107_11-R2.fastq.gz box to Read 2 (fastq).
    • Rename CaptureDesign_chr4.bed box to Capture Design (bed).
  2. Save the workflow.

9 - Run a Workflow

  1. Copy the following files to a new history:
    • CRN-107_11-R1.fastq.gz
    • CRN-107_11-R2.fastq.gz
    • CaptureDesign_chr4.bed
  2. Run the DNA-seq data analysis workflow:
    • Select the appropriate input files and parameters.

  • How many reads are discarded due to low mapping quality?